Method for extracting information from a database

ABSTRACT

A method operates on a database to extract and present information to a user. The database comprises data tables containing values of a number of variables. The information is to be extracted by evaluating at least one mathematical function which operates on one or more selected calculation variables. The presented information is to be partitioned on one or more selected classification variables. The method comprises the steps of identifying all boundary tables; identifying all connecting tables; electing a starting table among said boundary and connecting tables; building a conversion structure that links values of each selected variable in the boundary tables to corresponding values of one or more connecting variables in the starting table; and evaluating the mathematical function for each data record of the starting table, by using the conversion structure, such that the evaluation yields a final data structure containing a result of the mathematical function for every unique value of each classification variable.

TECHNICAL FIELD

The present invention relates to a method for extracting informationfrom a database. The database comprises a number of data tablescontaining values of a number of variables, each data table consistingof at least one data record including at least two of said values. Theinformation is extracted by evaluation of at least one mathematicalfunction, which operates on one or more selected calculation variables.Further, the extracted information is partitioned on one or moreselected classification variables.

BACKGROUND OF THE INVENTION

It is often desired to extract specific information from a database thatis stored on a secondary memory of a computer. More specifically, thereis need to summarise a large amount of data in the database, and presentthe summarised data to a user in a lucid way. For example, a user mightbe interested in extracting total sales per year and client from adatabase including transaction data for a large company. Thus, theextraction involves evaluation of a mathematical function, e.g. asummation (“SUM(x*y)”), operating on a combination of calculationvariables (x, y), e.g. the number of sold items (“Number”) and the priceper item (“Price”). The extraction also involves partitioning theinformation according to classification variables, e.g. “Year” and“Client”. Thus, the classification variables define how the result ofthe mathematical operation should be presented. In this specific case,the extraction of the total sales per year by client would involveevaluation of “SUM(Number*Price) per Year, Client”.

In one prior-art solution, a computer program is designed to process thedatabase and to evaluate all conceivable mathematical functionsoperating on all conceivable calculation variables partitioned on allconceivable classification variables, also called dimensions. The resultof this operation is a large data structure commonly known as amultidimensional cube. This multidimensional cube is obtained through avery time-consuming operation, which typically is performed over-night.The cube contains the evaluated results of the mathematical functionsfor every unique combination of the occurring values of theclassification variables. The user can then, in a different computerprogram operating on the multidimensional cube, explore the data of thedatabase, for example by visualising selected data in pivot tables orgraphically in 2D and 3D charts. When the user defines a mathematicalfunction and one or more classification variables, all otherclassification variables are eliminated through a summation over theresults stored in the cube for this mathematical function, the summationbeing made for all other classification variables. Thus, by adding orremoving classification variables, the user can move up or down in thedimensions of the cube.

This approach has some undesired limitations. If the multidimensionalcube after evaluation contains average quantities, e.g. the averagesales partitioned on a number of classification variables, the usercannot eliminate one or more of these classification variables since asummation over average quantities does not yield a correct totalaverage. In this case, the multidimensional cube must contain theaverage quantity split on every conceivable combination ofclassification variables as well, adding an extra complexity to theoperation of building the multidimensional cube. The same problem arisesfor other quantities, e.g. median values.

Often it is difficult to predict all relevant mathematical functions,calculation variables and classification variables before making a firstexamination of the data in the database. Upon identifying trends andpatterns, the user might find a need to add a function or a variable toreach underlying details in the data. Then, the time-consuming procedureof building a new multidimensional cube must be initiated.

SUMMARY OF THE INVENTION

Accordingly, the object of the present invention is to mitigate theabove-mentioned drawbacks and to provide a method for extractinginformation from a database, which method allows the user to freelyselect mathematical functions and incorporate calculation variables inthese functions as well as to freely select classification variables forpresentation of the results.

This object is achieved by a method having the features recited inindependent claim 1. Preferred embodiments are recited in the dependentclaims.

According to the present invention there is provided a method forgenerating a final data structure, i.e. a multidimensional cube, fromdata in a database in an efficient way, with respect to both processtime and memory requirement. Since the cube can be generated much fasterthan in prior-art solutions, it is possible to generate multidimensionalcubes ad hoc. The user can interactively define and generate a cubewithout being limited to a very small number of mathematical functionsand variables. The mathematical function is normally composed of acombination of mathematical expressions. If the user needs to modify themathematical function by changing, adding or removing a mathematicalexpression, a new cube can normally be generated in a time short enoughnot to disturb the user in his work. Similarly, if the user desires toadd or remove a variable, the cube can be rapidly regenerated.

This is achieved by a clever grouping of all relevant data tables intoboundary tables and connecting tables, respectively, based on the typeof variables included in each table. By electing one of these tables asa starting point and by building an appropriate conversion structure,the final data structure can be efficiently generated from the startingtable by use of the conversion structure.

Preferably, the data records of the database are first read into theprimary memory of a computer so that the data can be processed off-line.This will further reduce the time for searching the database andgenerating the final data structure. The database may be stored on asecondary memory or be a remotely stored database to which the computeris connected by a modem. It is to be understood that the database thusread into the primary memory may be a selected part of a larger databaseor a combination of two or more databases.

In one preferred embodiment, each different value of each data variableis assigned a binary code and the data records are stored inbinary-coded form. On account of the binary coding, very rapid searchescan be conducted in the data tables. Moreover, redundant information canbe removed, resulting in a reduced amount of data.

In another preferred embodiment, all boundary and connecting tables thatinclude calculation variables with a need for frequency data, i.e.variables for which the number of duplicates of each value is necessaryfor correct evaluation of the mathematical function, define a subset. Byelecting the starting table from this subset and by including frequencydata in the conversion structure, memory-efficient storage of duplicatescan be achieved when building the final data structure.

In the conversion structure, the frequency data can be included byduplication of each value, i.e. the conversion structure will contain alink from each value of a connecting variable in the starting table to acorrect number of each value of at least one corresponding selectedvariable in a boundary table. Alternatively, a counter may be includedin the conversion structure for each unique value of each connectingvariable in the starting table.

Preferably, the boundary or connecting table having the largest numberof data records is elected as starting table. This tends to minimise theamount of frequency data that must be incorporated in the conversionstructure, which therefore can be built more rapidly.

In a further preferred embodiment, a virtual data record is created byreading a data record of the starting table and by using the conversionstructure to convert each value of each connecting variable in this datarecord into a value of at least one corresponding selected variable.Thereby, the virtual data record will contain a current combination ofvalues of the selected variables. The final data structure can begradually built by sequentially reading data records from the startingtable, by updating the content of the virtual data record based on thecontent of each such data record, and by evaluating the mathematicalfunction based on the content of each such updated virtual data record.This procedure minimises the amount of computer memory that is neededfor extracting the requested information from the database. Further,virtual data records containing undefined values, so-called NULL values,of any calculation variable can often be immediately removed, inparticular when all calculation variables exhibit NULL-values, since inmany cases such values can not be used in the evaluation of themathematical function. This feature will contribute to an optimisedperformance.

In another preferred embodiment, an intermediate data structure is builtbased on the content of the virtual data record. Each data record of theintermediate data structure contains a field for each selectedclassification variable and an aggregation field for each mathematicalexpression included in the mathematical function. For each updatedvirtual data record, each mathematical expression is evaluated and theresult is aggregated in the appropriate aggregation field based on thecurrent value of each selected classification variable. Such anintermediate data structure allows the user to combine mathematicalexpressions with different need for frequency data in one mathematicalfunction. By building several conversion structures incorporatingcorresponding frequency data, and by evaluating the data records of astarting table for each such mathematical expression based on acorresponding conversion structure, it is possible to merge the resultsof these evaluations in one intermediate data structure. Likewise, ifthe user modifies the mathematical function by adding a new mathematicalexpression operating on the already selected calculation variables, itis only necessary to add an aggregation field to the existingintermediate data structure, or to extend an existing aggregation field.

It should be noted that the virtual data record in general is indeedvirtual, i.e. it is not physically allocated any memory, during thetransition from a data record of the starting table to the final datastructure. However, such a virtual data record can always, at leastimplicitly, be identified in the procedure of converting the content ofa data record of the starting table into current values of the selectedvariables.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects and advantages of the present invention willbecome more apparent and more readily appreciated to those skilled inthe art upon reading the following description of the preferredembodiments, taken in conjunction with the accompanying drawings,wherein:

FIG. 1 shows exemplary Tables 1–5.

FIG. 2 shows a block flow chart of an exemplary method for extractinginformation from a database.

FIG. 3 shows exemplary Tables 6–12.

FIG. 4 shows exemplary Tables 13–16.

FIG. 5 shows exemplary Tables 17, 18, and 20–23.

FIG. 6 shows exemplary Tables 24–29.

DESCRIPTION OF PREFERRED EMBODIMENTS

The present invention will now be described by way of examples,reference being made to FIGS. 1–6 of the drawings, FIG. 1 showing thecontent of a database after identification of relevant data tablesaccording to the inventive method, FIG. 2 showing a sequence of steps ofan embodiment of the method according to the invention, and FIGS. 3–6showing exemplary data tables.

A database, as shown in FIG. 1, comprises a number of data tables(Tables 1–5). Each data table contains data values of a number of datavariables. For example, in Table 1 each data record contains data valuesof the data variables “Product”, “Price” and “Part”. If there is nospecific value in a field of the data record, this field is consideredto hold a NULL-value. Similarly, in Table 2 each data record containsvalues of the variables “Date”, “Client”, “Product” and “Number”.Typically, the data values are stored in the form of ASCII-codedstrings.

The method according to the present invention is implemented by means ofa computer program. In a first step (step 101), the program reads alldata records in the database, for instance using a SELECT statementwhich selects all the tables of the database, i.e. Tables 1–5 in thiscase. Typically, the database is read into the primary memory of thecomputer.

To increase the evaluation speed, it is preferred that each unique valueof each data variable in said database is assigned a different binarycode and that the data records are stored in binary-coded form (step101). This is typically done when the program first reads the datarecords from the database. For each input table, the following steps arecarried out. First the column names, i.e. the variables, of the tableare successively read. Every time a new data variable appears, a datastructure is instantiated for it. Then, an internal table structure isinstantiated to contain all the data records in binary form, whereuponthe data records are successively read and binary-coded. For each datavalue, the data structure of the corresponding data variable is checkedto establish if the value has previously been assigned a binary code. Ifso, that binary code is inserted in the proper place in theabove-mentioned table structure. If not, the data value is added to thedata structure and assigned a new binary code, preferably the next onein ascending order, before being inserted in the table structure. Inother words, for each data variable, a unique binary code is assigned toeach unique data value.

Tables 6–12 of FIG. 3 show the binary codes assigned to different datavalues of some data variables that are included in the database of FIG.1.

After having read all data records in the database, the program analysesthe database to identify all connections between the data tables (step102). A connection between two data tables means that these data tableshave one variable in common. Different algorithms for performing such ananalysis are known in the art. After the analysis all data tables arevirtually connected. In FIG. 1, such virtual connections are illustratedby double-ended arrows (a). The virtually connected data tables shouldform at least one so-called snowflake structure, i.e. a branching datastructure in which there is one and only one connecting path between anytwo data tables in the database. Thus, a snowflake structure does notcontain any loops. If loops do occur among the virtually connected datatables, e.g. if two tables have more than one variable in common, asnowflake structure can in some cases still be formed by means ofspecial algorithms known in the art for resolving such loops.

After this initial analysis, the user can start to explore the database.In doing so, the user defines a mathematical function, which could be acombination of mathematical expressions (step 103). Assume that the userwants to extract the total sales per year and client from the databasein FIG. 1. The user defines a corresponding mathematical function“SUM(x*y)”, and selects the calculation variables to be included in thisfunction: “Price” and “Number”. The user also selects the classificationvariables: “Client” and “Year”.

The computer program then identifies all relevant data tables (step104), i.e. all data tables containing any one of the selectedcalculation and classification variables, such data tables being denotedboundary tables, as well as all intermediate data tables in theconnecting path(s) between these boundary tables in the snowflakestructure, such data tables being denoted connecting tables. For thesake of clarity, the group of relevant data tables (Tables 1–3) isincluded in a first frame (A) in FIG. 1. Evidently, there are noconnecting tables in this particular case.

In the present case, all occurrences of every value, i.e. frequencydata, of the selected calculation variables must be included forevaluation of the mathematical function. In FIG. 1, the selectedvariables (“Price”, “Number”) requiring such frequency data areindicated by bold arrows (b), whereas remaining selected variables areindicated by dotted lines (b′). Now, a subset (B) can be defined thatincludes all boundary tables (Tables 1–2) containing such calculationvariables and any connecting tables between such boundary tables in thesnowflake structure. It should be noted that the frequency requirementof a particular variable is determined by the mathematical expression inwhich it is included. Determination of an average or a median calls forfrequency information. In general, the same is true for determination ofa sum, whereas determination of a maximum or a minimum does not requirefrequency data of the calculation variables. It can also be noted thatclassification variables in general do not require frequency data.

Then, a starting table is elected, preferably among the data tableswithin subset (B), most preferably the data table with the largestnumber of data records in this subset (step 105). In FIG. 1, Table 2 iselected as the starting table. Thus, the starting table containsselected variables (“Client”, “Number”), and connecting variables(“Date”, “Product”). These connecting variables link the starting table(Table 2) to the boundary tables (Tables 1 and 3).

Thereafter, a conversion structure is built (step 106), as shown inTables 13 and 14 of FIG. 4. This conversion structure is used fortranslating each value of each connecting variable (“Date”, “Product”)in the starting table (Table 2) into a value of a corresponding selectedvariable (“Year”, “Price”) in the boundary tables (Table 3 and 1,respectively). Table 13 is built by successively reading data records ofTable 3 and creating a link between each unique value of the connectingvariable (“Date”) and a corresponding value of the selected variable(“Year”). It can be noted that there is no link from value 4(“Date:1999-01-12”), since this value is not included in the boundarytable. Similarly, Table 14 is built by successively reading data recordsof Table 1 and creating a link between each unique value of theconnecting variable (“Product”) and a corresponding value of theselected variable (“Price”). In this case, value 2 (“Product:Toothpaste”) is linked to two values of the selected variable (“Price:6.5”), since this connection occurs twice in the boundary table. Thus,frequency data is included in the conversion structure. Also note thatthere is no link from value 3 (“Product: Shampoo”).

When the conversion structure has been built, a virtual data record iscreated. Such a virtual data record, as shown in Table 15, accommodatesall selected variables (“Client”, “Year”, “Price”, “Number”) in thedatabase. In building the virtual data record (step 107–108), a datarecord is first read from the starting table (Table 2). Then, the valueof each selected variable (“Client”, “Number”) in the current datarecord of the starting table is incorporated in the virtual data record.Also, by using the conversion structure (Tables 13–14) each value ofeach connecting variable (“Date”, “Product”) in the current data recordof the starting table is converted into a value of a correspondingselected variable (“Year”, “Price”), this value also being incorporatedin the virtual data record.

At this stage (step 109), the virtual data record is used to build anintermediate data structure (Table 16). Each data record of theintermediate data structure accommodates each selected classificationvariable (dimension) and an aggregation field for each mathematicalexpression implied by the mathematical function. The intermediate datastructure (Table 16) is built based on the values of the selectedvariables in the virtual data record. Thus, each mathematical expressionis evaluated based on one or more values of one or more relevantcalculation variables in the virtual data record, and the result isaggregated in the appropriate aggregation field based on the combinationof current values of the classification variables (“Client”, “Year”).

The above procedure is repeated for all data records of the startingtable (step 110). Thus, an intermediate data structure is built bysuccessively reading data records of the starting table, byincorporating the current values of the selected variables in a virtualdata record, and by evaluating each mathematical expression based on thecontent of the virtual data record. If the current combination of valuesof classification variables in the virtual data record is new, a newdata record is created in the intermediate data structure to hold theresult of the evaluation. Otherwise, the appropriate data record israpidly found, and the result of the evaluation is aggregated in theaggregation field. Thus, data records are added to the intermediate datastructure as the starting table is traversed. Preferably, theintermediate data structure is a data table associated with an efficientindex system, such as an AVL or a hash structure. In most cases, theaggregation field is implemented as a summation register, in which theresult of the evaluated mathematical expression is accumulated. In somecases, e.g. when evaluating a median, the aggregation field is insteadimplemented to hold all individual results for a unique combination ofvalues of the specified classification variables. It should be notedthat only one virtual data record is needed in the procedure of buildingthe intermediate data structure from the starting table. Thus, thecontent of the virtual data record is updated for each data record ofthe starting table. This will minimise the memory requirement inexecuting the computer program.

The procedure of building the intermediate data structure will befurther described with reference to Tables 15–16. In creating the firstvirtual data record R1, as shown in Table 15, the values of the selectedvariables “Client” and “Number” are directly taken from the first datarecord of the starting table (Table 2). Then, the value “1999-01-02” ofthe connecting variable “Date” is transferred into the value “1999” ofthe selected variable “Year”, by means of the conversion structure(Table 13). Similarly, the value “Toothpaste” of the connecting variable“Product” is transferred into the value “6.5” of the selected variable“Price” by means of the conversion structure (Table 14), thereby formingthe virtual data record R1. Then, a data record is created in theintermediate data structure, as shown in Table 16. In this case, theintermediate data structure has tree columns, two of which holdsselected classification variables (“Client”, “Year”). The third columnholds an aggregation field, in which the evaluated result of themathematical expression (“x*y”) operating on the selected calculationvariables (“Number”, “Price”) is aggregated. In evaluating virtual datarecord R1, the current values (binary codes: 0,0) of the classificationvariables are first read and incorporated in this data record of theintermediate data structure. Then, the current values (binary codes:2,0) of the calculation variables are read. The mathematical expressionis evaluated for these values and added to the associated aggregationfield.

Next, the virtual data record is updated based on the starting table.Since the conversion structure (Table 14) indicates a duplicate of thevalue “6.5” of the selected variable “Price” for the value “Toothpaste”of the connecting variable “Product”, the updated virtual data record R2is unchanged and identical to R1. Then, the virtual data record R2 isevaluated as described above. In this case, the intermediate datastructure contains a data record corresponding to the current values(binary codes: 0,0) of the classification variables. Thus, the evaluatedresult of the mathematical expression is accumulated in the associatedaggregation field.

Next, the virtual data record is updated based on the second data recordof starting table. In evaluating this updated virtual data record R3, anew data record is created in the intermediate data structure, and soon.

It should be noted that NULL values are represented by a binary code of−2 in this example. In the illustrated example, it should also be notedthat any virtual data records holding a NULL value (−2) of any one ofthe calculation variables can be directly eliminated, since NULL valuescan not be evaluated in the mathematical expression (“x*y”). It shouldalso be noted that all NULL values (−2) of the classification variablesare treated as any other valid value and are placed in the intermediatedata structure.

After traversing the starting table, the intermediate data structurecontains four data records, each including a unique combination ofvalues (0,0; 1,0; 2,0; 3,−2) of the classification variables, and thecorresponding accumulated result (41; 37.5; 60; 75) of the evaluatedmathematical expression.

Preferably, the intermediate data structure is also processed toeliminate one or more classification variables (dimensions). Preferably,this is done during the process of building the intermediate datastructure, as described above. Every time a virtual data record isevaluated, additional data records are created, or found if they alreadyexist, in the intermediate data structure. Each of these additional datarecords is destined to hold an aggregation of the evaluated result ofthe mathematical expression for all values of one or more classificationvariables. Thus, when the starting table has been traversed, theintermediate data structure will contain both the aggregated results forall unique combinations of values of the classification variables, andthe aggregated results after elimination of each relevant classificationvariable.

This procedure of eliminating dimensions in the intermediate datastructure will be further described with reference to Tables 15 and 16.When virtual data record R1 is evaluated (Table 15) and the first datarecord (0,0) is created in the intermediate data structure, additionaldata records are created in this structure. Such additional data recordsare destined to hold the corresponding results when one or moredimensions are eliminated. In Table 16, a classification variable isassigned a binary code of −1 in the intermediate data structure todenote that all values of this variable are evaluated. In this case,three additional data records are created, each holding a newcombination of values (−1,0; 0,−1; −1,−1) of the classificationvariables. The evaluated result is aggregated in the associatedaggregation field of these additional data records. The first (−1,0) ofthese additional data records is destined to hold the aggregated resultfor all values of the classification variable “Client” when theclassification variable “Year” has the value “1999”. The second (0,−1)additional data record is destined to hold the aggregated result for allvalues of the classification variable “Year” when the classificationvariable “Client” is “Nisse”. The third (−1,−1) additional data recordis destined to hold the aggregated result for all values of bothclassification variables “Client” and “Year”.

When virtual data record R2 is evaluated, the result is aggregated inthe aggregation field associated with the current combination of values(binary codes: 0,0) of the classification variables, as well as in theaggregation fields associated with relevant additional data records(binary codes: −1,0; 0,−1; −1,−1). When virtual data record R3 isevaluated, the result is aggregated in the aggregation field associatedwith the current combination of values (binary codes: 1,0) of theclassification variables. The result is also aggregated in theaggregation field of a newly created additional data record (binarycodes: 1,−1) and in the aggregation fields associated with relevantexisting data records (binary codes: −1,0; −1,−1) in the intermediatedata structure.

After traversing the starting table, the intermediate data structurecontains eleven data records, as shown in Table 16.

Preferably, if the intermediate data structure accommodates more thantwo classification variables, the intermediate data structure will, foreach eliminated classification variable, contain the evaluated resultsaggregated over all values of this classification variable for eachunique combination of values of remaining classification variables.

When the intermediate data structure has been built, a final datastructure, i.e. a multidimensional cube, as shown in non-binary notationin Table 17 of FIG. 5, is created by evaluating the mathematicalfunction (“SUM (x*y)”) based on the results of the mathematicalexpression (“x*y”) contained in the intermediate data structure (step111). In doing so, the results in the aggregation fields for each uniquecombination of values of the classification variables are combined. Inthe illustrated case, the creation of the final data structure isstraightforward, due to the trivial nature of the present mathematicalfunction. The content of the final data structure might then (step 112)be presented to the user in a two-dimensional table, as shown in Table18 of FIG. 5. Alternatively, if the final data structure contains manydimensions, the data can be presented in a pivot table, in which theuser interactively can move up and down in dimensions, as is well knownin the art.

Below, a second example of the inventive method is described withreference to Tables 20–29 of FIGS. 5–6. The description will onlyelaborate on certain aspects of this example, namely building aconversion structure including data from connecting tables, and buildingan intermediate data structure for a more complicated mathematicalfunction. In this example, the user wants to extract sales data perclient from a database, which contains the data tables shown in Tables20–23 of FIG. 5. For ease of interpretation, the binary coding isomitted in this example.

The user has specified the following mathematical functions, for whichthe result should be partitioned per Client:

-   -   a) “IF(Only(Environment index)=‘I’) THEN Sum(Number*Price)*2,        ELSE Sum(Number*Price))”, and    -   b) “Avg(Number*Price)”

The mathematical function (a) specifies that the sales figures should bedoubled for products that belong to a product group having anenvironment index of ‘I’, while the actual sales figures should be usedfor other products. The mathematical function (b) has been included forreference.

In this case, the selected classification variables are “Environmentindex” and “Client”, and the selected calculation variables are “Number”and “Price”. Tables 20, 22 and 23 are identified as boundary tables,whereas Table 21 is identified as a connecting table. Table 20 iselected as starting table. Thus, the starting table contains selectedvariables (“Number”, “Client”), and a connecting variable (“Product”).The connecting variable links the starting table (Table 20) to theboundary tables (Tables 22–23), via the connecting table (Table 21).

Next, the formation of the conversion structure will be described withreference to Tables 24–26 of FIG. 6. A first part (Table 24) of theconversion structure is built by successively reading data records of afirst boundary table (Table 23) and creating a link between each uniquevalue of the connecting variable (“Product group”) and a correspondingvalue of the selected variable (“Environment index”). Similarly, asecond part (Table 25) of the conversion structure is built bysuccessively reading data records of a second boundary table (Table 22)and creating a link between each unique value of the connecting variable(“Price group”) and a corresponding value of the selected variable(“Price”). Then, data records of the connecting table (Table 21) areread successively. Each value of the connecting variables (“Productgroup” and “Price group”, respectively) in Tables 24 and 25 issubstituted for a corresponding value of a connecting variable(“Product”) in Table 21. The result is merged in one final conversionstructure, as shown in Table 26.

Then, an intermediate data structure is built by successively readingdata records of the starting table (Table 20), by using the conversionstructure (Table 26) to incorporate the current values of the selectedvariables (“Environment index”, “Client”, “Number”, “Price”) in thevirtual data record, and by evaluating each mathematical expressionbased on the current content of the virtual data record.

For reasons of clarity, Table 27 displays the corresponding content ofthe virtual data record for each data record of the starting table. Asnoted in connection with the first example, only one virtual data recordis needed. The content of this virtual data record is updated, i.e.replaced, for each data record of the starting table.

Each data record of the intermediate data structure, as shown in Table28, accommodates a value of each selected classification variable(“Client”, “Environment index”) and an aggregation field for eachmathematical expression implied by the mathematical functions. In thiscase, the intermediate data structure contains two aggregation fields.One aggregation field contains the aggregated result of the mathematicalexpression (“x*y”) operating on the selected calculation variables(“Number”, “Price”), as well as a counter of the number of suchoperations. The layout of this aggregation field is given by the factthat an average quantity should be calculated (“Avg(x*y)”). The otheraggregation field is designed to hold the lowest and highest values ofthe classification variable “Environment index” for each combination ofvalues of the classification variables.

As in the first example, the intermediate data structure (Table 28) isbuilt by evaluating the mathematical expression for the current contentof the virtual data record (each row in Table 27), and by aggregatingthe result in the appropriate aggregation field based on the combinationof current values of the classification variables (“Client”,“Environment index”). The intermediate data structure also includes datarecords in which the value “<ALL>” has been assigned to one or both ofthe classification variables. The corresponding aggregation fieldscontain the aggregated result when the one or more classificationvariables (dimensions) are eliminated.

When the intermediate data structure has been built, a final datastructure, i.e. a multidimensional cube, is created by evaluating themathematical functions based on the evaluated results of themathematical expressions contained in the intermediate data structure.Each data record of the final data structure, as shown in Table 29,accommodates a value of each selected classification variable (“Client”,“Environment index”) and an aggregation field for each mathematicalfunction selected by the user.

The final data structure is built based on the results in theaggregation fields of the intermediate data structure for each uniquecombination of values of the classification variables. When function (a)is evaluated, by sequentially reading data records of Table 28, theprogram first checks if both values in the last column of Table 28 isequal to ‘I’. If so, the relevant result contained in the firstaggregation field of Table 28 is multiplied by two and stored in Table29. If not, the relevant result contained in the first aggregation fieldof Table 28 is directly stored in Table 29. When function (b) isevaluated, the aggregated result of the mathematical expression (“x*y”)operating on the selected calculation variables (“Number”, “Price”) isdivided by the number of such operations, both of which are stored inthe first aggregation field of Table 28. The result is stored in thesecond aggregation field of Table 29.

Evidently, the present invention allows the user to freely selectmathematical functions and incorporate calculation variables in thesefunctions as well as to freely select classification variables forpresentation of the results.

As an alternative, albeit less memory-efficient, to the illustratedprocedure of building an intermediate data structure based on sequentialdata records from the starting table, it is conceivable to first build aso-called join table. This join table is built by traversing all datarecords of the starting table and, by use of the conversion structure,converting each value of each connecting variable in the starting tableinto a value of at least one corresponding selected variable in aboundary table. Thus, the data records of the join table will containall occurring combinations of values of the selected variables. Then,the intermediate data structure is built based on the content of thejoin table. For each record of the join table, each mathematicalexpression is evaluated and the result is aggregated in the appropriateaggregation field based on the current value of each selectedclassification variable. However, this alternative procedure requiresmore computer memory to extract the requested information.

It should be realised that the mathematical function could containmathematical expressions having different, and conflicting, needs forfrequency data. In this case, steps 104–110 (FIG. 2) are repeated foreach such mathematical expression, and the results are stored in onecommon intermediate data structure. Alternatively, one final datastructure, i.e. multidimensional cube, could be built for eachmathematical expression, the contents of these cubes being fused duringpresentation to the user.

1. A computer implemented method to facilitate data fusion for buildinga final multi-dimensional cube data structure by extracting informationfrom a database, which comprises a number of data tables containingvalues of a number of variables, each data table consisting of at leastone data record including at least two of said values, said informationbeing extracted by evaluation of at least one mathematical functionoperating on one or more selected calculation variables, said extractedinformation being partitioned on one or more selected classificationvariables, characterized by the steps of: initially assigning adifferent binary code to each unique value of each data variable in saiddatabase and storing the data records in binary-coded form in createdinternal table structures; identifying all data tables containing atleast one value of one of said selected calculation or classificationvariables, such data tables being boundary tables; identifying all datatables that, directly or indirectly, have variables in common with saidboundary tables and connect the same, such data tables being connectingtables; electing a starting table among said boundary and connectingtables; building a conversion structure that links values of eachselected variable in said boundary tables to corresponding values of oneor more connecting variables in said starting table; evaluating saidmathematical function for each data record of said starting table, byusing said conversion structure to convert each value of each connectingvariable into at least one value of at least one corresponding selectedvariable; and generating said final data structure based on saidevaluation, said final data structure containing an extracted result ofsaid mathematical function for every unique value of each classificationvariable.
 2. A method as set forth in claim 1, characterized by thefurther step of presenting relevant parts of said resulting datastructure to the user in human-readable form.
 3. A method as set forthin claim 1, characterized by the further step of initially reading saiddata records of said database into the primary memory of a computer. 4.A method as set forth in claim 1, characterized by the further steps ofinitially identifying all data tables in said database that havevariables in common, and assigning virtual connections between such datatables, thereby creating a database with a snowflake structure, whereinsaid connecting tables are located between said boundary tables in saidsnowflake structure.
 5. A method as set forth in claim 1, characterizedby the further steps of identifying all calculation variables for whichthe number of occurrences of each value is necessary for correctevaluation of said mathematical function, defining a subset of datatables consisting of boundary tables containing such variables and datatables connecting such boundary tables, electing said starting tablefrom said subset, and including data on said number of occurrences ofeach value in said conversion structure.
 6. A method as set forth inclaim 1, characterized in that said starting table is the data tableamong said boundary and connecting tables having the largest number ofdata records.
 7. A method as set forth in claim 1, characterized by thefurther step of building said final data structure, which includes anumber of data records, each of which contains a field for each selectedclassification variable and an aggregation field for said mathematicalfunction, wherein said building step includes sequentially reading adata record of said starting table, creating a current combination ofvalues of said selected variables by using said conversion structure toconvert each value of each connecting variable in said data record intoa value of at least one corresponding selected variable, evaluating saidmathematical function for said current combination of values, andaggregating the result of said evaluation in the appropriate aggregationfield based on the current value of each selected classificationvariable.
 8. A method as set forth in claim 1, characterized by thefurther step of creating a virtual data record containing a combinationof values of said selected variables, wherein said creating stepincludes reading a data record of said starting table and using saidconversion structure to convert each value of each connecting variablein said data record into a value of at least one corresponding selectedvariable.
 9. A method as set forth in claim 8, characterized by thefurther step of building said final data structure which includes anumber of data records, each of which contains a field for each selectedclassification variable and an aggregation field for said mathematicalfunction, wherein said building step includes sequentially reading adata record of said starting table, updating the content of said virtualdata record based on the content of each such data record, evaluatingsaid mathematical function based on said updated virtual data record,and aggregating the result of said evaluation in the appropriateaggregation field based on the current value of each selectedclassification variable in said updated virtual data record.
 10. Amethod as set forth in claim 8, characterized by the further step ofbuilding an intermediate data structure which includes a number of datarecords, each of which contains a field for each selected classificationvariable and an aggregation field for each mathematical expressionimplied by said mathematical function, wherein said building stepincludes sequentially reading a data record of said starting table,updating the content of said virtual data record based on the content ofeach such data record, evaluating each mathematical expression based onsaid updated virtual data record, and aggregating the result of saidevaluation in an appropriate aggregation field based on the currentvalue of each selected classification variable in said updated virtualdata record.
 11. A method as set forth in claim 10, characterized inthat said step of building said intermediate data structure includes:eliminating one of said classification variables in said intermediatedata structure by aggregating said results over all values of said oneclassification variable for each unique combination of values ofremaining classification variables, by creating additional data records,and by incorporating said aggregated results in said additional datarecords of said intermediate data structure.
 12. A method as set forthin claim 10, characterized by the further step of evaluating saidmathematical function based on said results in said aggregation fieldsfor each unique combination of values of said classification variables,thereby building said final data structure.
 13. A method as set forth inclaim 1, characterized in that said step of building said conversionstructure includes: a) reading data records of a boundary table, andcreating a conversion structure including a link between each uniquevalue of at least one connecting variable in said boundary table andeach corresponding value of at least one selected variable therein; b)processing through said boundary table towards said starting table; c)if a connecting table is found, reading a data records of saidconnecting table, and substituting each unique value of said at leastone connecting variable in said conversion structure for at least onecorresponding unique value of at least one connecting variable in saidconnecting table; and d) repeating steps (b)–(c) until said startingtable is found.
 14. An article of manufacture comprising a computerreadable medium having stored thereon a computer program for effectingthe steps of a method for extracting information from a database as setforth in claim 1.